Kim et a I. BMC Genomics 2013, 14:349 
http://www.bionnedcentral.conn/1471 -21 64/1 4/349 



Genomics 



METHODOLOGY ARTICLE Open Access 



NEXT-peak: a normal-exponential two-peak 
model for peak-calling in ChlP-seq data 

Nak-Kyeong Kim^" Rasika V Jayatillake^ and John L Spouge^ 



Abstract 

Background: Chromatin immunoprecipitation followed by high-throughput sequencing (ChlP-seq) can locate 
transcription factor binding sites on genomic scale. Although many models and programs are available to call 
peaks, none has dominated its competition in comparison studies. 

Results: We propose a rigorous statistical model, the normal-exponential two-peak (NEXT-peak) model, which 
parallels the physical processes generating the empirical data, and which can naturally incorporate mappability 
information. The model therefore estimates total strength of binding (even if some binding locations do not map 
uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding 
location. The comparison study with existing programs on real ChlP-seq datasets (STATl, NRSF, and ZNF143) 
demonstrates that the NEXT-peak model performs well both in calling peaks and locating them. The model also 
provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region. 

Conclusions: The NEXT-peak program calls peaks on any test dataset about as accurately as any other, but 
provides unusual accuracy in the estimated location of the peaks it calls. NEXT-peak is based on rigorous statistics, 
so its model also provides a principled foundation for a more elaborate statistical analysis of ChlP-seq data. 

Keywords: ChlP-seq, Normal-exponential distribution. Continuous mixture, Poisson regression, Goodness-of-fit 



Background 

ChlP-seq experiments use chromatin immunoprecipita- 
tion and then high-throughput sequencing, primarily to 
locate transcription factor binding sites across entire ge- 
nomes, and to better our understanding of biological 
control systems [1]. As a brief overview of the relevant 
experimental protocols, they begin by irreversibly cross- 
linking a transcription factor (TF) molecule to its bind- 
ing site in genomic DNA. They then shear the DNA into 
millions of short sequence fragments. Usually, the ends 
of a fragment are near the corresponding cross-link, but 
the exact distance between the end of the fragment and 
the cross -link is random. Moreover, the fragments on 
the two DNA strands show different systematic biases in 
the positions of their ends relative to the cross -link. 
Antibodies to the TF then precipitate each TF molecule 
along with its attached fragment. Fragments are dissoci- 
ated from the TF molecules, amplified by polymerase 
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chain reaction (PGR). The fragments are then sequenced 
into short subsequences, called "tags". Computational ana- 
lysis then enters by mapping the tag sequences to a refer- 
ence genome. If a tag sequence is long enough, the tag 
matches only one genomic coordinate. Sometimes, how- 
ever, its sequence is short and maps to more than one co- 
ordinate, making the mapping ambiguous. The possibilities 
of ambiguous mapping, false positive tag-reads, and other 
experimental errors have motivated the development of 
programs to analyze ChlP-seq experiments. 

Peak-calling programs locate potential binding sites as 
"peaks" where mapped tags concentrate. Peak-calling 
programs use widely differing approaches, none of which 
has yet emerged as dominant in reviews [2-4], because 
relative accuracy of programs varies with the dataset ex- 
amined [2,3]. Improvement is probably possible, how- 
ever, because the models underlying existing programs 
do not consider mapping ambiguities directly, despite 
the existence of packages for enumerating the ambigu- 
ities, e.g., the PeakSeq suite [5]. Moreover, some pro- 
grams ignore strand-specific information [6,7]. 
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Most programs use (sometimes implicitly) kernel 
smoothing to compensate for mapping ambiguities, the 
most popular kernel being the uniform density, which is 
equivalent to counting tags in sliding windows of fixed- 
length [8-12]. Programs also manipulate information about 
a tag s strand in various ways: as mentioned, some ignore it 
[6,7]; some use it explicitly (e.g. SISSRs [12], spp [11]); and 
some use it indirectly by transposing tag locations over to 
the other strand (e.g. MACS [8], QuEST [13], CisGenome 
[9]). Programs that combine strand information and 
windowing are essentially using two separate uniform 
densities as kernels for the forward and bacl<ward strands. 
For example, QuEST [13] (among other computer pro- 
grams) estimates tag densities with a kernel smoother than 
the uniform density, to mimic the observed shape of tag 
peaks. QuEST did not dominate in comparisons, however, 
perhaps because it transposes tag locations, rather than es- 
timating two separate tag densities, one for each strand. 

The significance of peaks can be reported either as the 
number of tags in a window, a p-value, a q-value, or a pos- 
terior probability. Although p-values guide a naive user 
better than tag numbers, they introduce problematic as- 
sumptions. To derive a p-value, some programs [7,13] as- 
sume a globally uniform background intensity of tags, an 
assumption known to fail in ChIP experiments. To assign 
the significance of peaks, different programs use various 
model assumptions such as Poisson [7,12,13], local Poisson 
[8], binomial [9], and hidden Markov [10] models. Some 
programs [8,9,11-13] use control data to account for local 
variations in background tag intensity or to compensate for 
experimental artifacts like PGR over-amplification, which 
can cause a spurious concentration of tags in a few specific 
locations. The reproducibility of control data is suspect, 
however, because it varies across cell types and GhIP proto- 
col [4]. Although control data mitigates some experimental 
artifacts, its unreliability ultimately undermines any infer- 
ence based on the corresponding p-value. 

Mapping ambiguities can also be problematic for a 
naive p-value calculation. Gonsider, e.g., a large genomic 
region where exactly A locations are ambiguously 
mapped (where A is fixed). In the same region, consider 
a window containing a total of L locations, including the 
A ambiguous locations. The window length is an arbi- 
trary parameter (within limits), and as it increases, L in- 
creases. The A ambiguous locations are essentially 
censored data, so the simplest maximum likelihood esti- 
mate of the tag count in the window is L/{L-A) times the 
observed tag count. Thus, if observed tag count is fixed, 
the estimated tag count decreases as the window length 
increases. If a p-value decreases with the estimated tag 
count, it then depends on the window length. False dis- 
covery rates (FDR) depend on p-values, so the use of 
FDRs does not remove the dependency. Under the cir- 
cumstances described, therefore, the arbitrary choice of 



window length influences the number of peaks reported. 
A recent study on the number of binding sites in a gen- 
ome [14] indicates that many real binding sites from 
GhlP-seq data go unreported, suggesting that the 
assumptions and approximations underlying current 
p-value estimates leave room for improvement. 

Intuitively, the spatial resolution of a peak should also 
improve as more tags contribute to it. In principle, 
therefore, a program should also assign errors to its lo- 
cation estimates, but in fact, existing programs do not 
infer the accuracy of their estimated peak locations. 

To examine the performance of the proposed model 
(NEXT-peak) against current standards, we selected sev- 
eral programs from a recent comparison [2]: HPeak [10], 
spp package (WTD and MTG) [11], GisGenome [9], 
MAGS [8], QuEST [13], and SISSRs [12]. A summary of 
selected programs is given in Table 1. The details of 
NEXT-peak appears in the Methods. 

Results 

Fitting the NEXT-peak model to ChlP-seq data 

Using GhlP-seq datasets for three TFs: STATl [15], NRSF, 
and ZNF143 (see Methods for details), results with and 
without mappability information were examined; for each 
dataset, only the better of the two are presented here. 
Mappability information improved results for STATl, but 
degraded results for NRSF and ZNF143. 

Searches with position-specific scoring matrices from 
JASPAR [16] yielded candidates for actual STATl, NRSF, 
or ZNF143 sites within each region with a binding event. 
The searches used the p-value cut-off 5x10'^ for all 
datasets. See Methods for details on the p-value compu- 
tation for finding motif sites. Figure la shows a density 
of the normal-exponential two-peak (NEXT-peak) model 
(see Methods). Figure Ib-d displays the tag number, nor- 
malized to a probability density, for each location 
around the position of the candidate sites. The observed 
tag density is superimposed on the estimated density 
(derived from model estimates of the expected tag num- 
bers, Xf or Xj in the Methods). Maximum likelihood es- 
timation on the NEXT-peak model produced parameter 
estimates underlying X^ and . Table 2 reports esti- 
mated parameter values for each dataset. 

For STATl, the observed tag counts follow the density 
curve of the NEXT-peak model with a small difference 
in terms of average trend, except for two unexplained 
dips (Figure lb). The dips (one for right tags, and one 
for left tags) display symmetry around the binding site. 
The tag counts for NRSF (Figure Ic) also follow the 
NEXT-peak density with a small trend difference. The 
tag counts for ZNF143 (Figure Id) show a larger trend 
difference from the NEXT-peak density, perhaps because 
the GhIP experiment was noisier. 
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Table 1 Summary of programs used for comparison 
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Examples of regions with unmappable locations 

Figure 2 shows three regions with large number of 
unmappable locations from STATl data. In Figure 2, 
unmappable locations are marked by grey blocks. In 
Figure 2a, 49% of locations are unmappable; in Figure 2b, 



41%; and in Figure 2c, 38%. The circles indicate motif 
sites; the triangles, estimated sites from the NEXT-peak 
model. The estimated sites approximate the motif sites 
reasonably well. The estimated tag counts due to the 
binding event are 636.8, 264.3, and 699.3; the total 
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Figure 1 Profile of the normal-exponential two-peak (NEXT-peak) density, (a) An example of NEXT-peak density profile without fitting to a 
particular dataset. The blue curve is for a tag profile on the left (positive) strand, the red curve is for the right (negative) strand. Parameter values 
are P = 60, and a = 40 (see Methods). The two density curves mirror each other around the center location, (b) Tag profile of STATl ChlP-seq 
data. From the motif search, thousands of motif sites were found. The cumulative tag counts were rescaled and displayed as densities, (c) NRSF. 
(d) ZNF143. Table 2 reports estimated values P and a for each dataset. 
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Table 2 Summary of ChlP-seq datasets 

Dataset tag length motif length number of tags a Genome 

STATl 27 15 15.1 million 73.5 52.8 Human build 36.1 

NRSF 36 21 33.1 million 30.4 23.6 Human build 36.1 

ZNF143 36 20 25.2 million 35.3 21.6 Human build 36.1 



observed tag counts in the region are 396, 209, and 492. 
Although tags at unmappable locations are not observ- 
able, the NEXT-peak model increases the corresponding 
estimated tag counts to compensate. The compensation 
permits NEXT-peak to sharpen estimates of binding 
strength. 

Comparison of the programs: top 2,000 peaks 

To compare peak-calling programs, we run NEXT-peak 
along with other popular programs like HPeak [10], spp 
package (WTD and MTC) [11], CisGenome [9], MACS 
[8], QuEST [13], and SISSRs [12] (see Background for 
details). On running these programs including NEXT- 
peak, following standard practice, we used the default 
parameters to ensure reproducibility. As the first stage 
of comparison, we consider the 2,000 top peaks from 
each. The NEXT-peak program uses the estimated tags 
per binding event (v) to rank its peaks. We considered 
every peak called within 250 bp of a candidate site (as 
determined by position-specific scoring matrix search) 
to be a true positive (TP). Our primary performance 
measure was the number of TPs within the 2,000 highest 
peaks. (Precision provides a standard but equivalent per- 
formance measure: the precision at 2,000 positives is the 
number of TPs within the 2,000 top peaks divided by 
the constant, 2,000.) Our secondary performance mea- 
sures considered placement of TP peaks: (1) the mean 
distance between a TP peak center and the nearest motif 
site, and (2) the mean bias between a TP peak center 
and the nearest motif site. A TP peak upstream of the 



nearest motif site contributes to a negative bias; down- 
stream, a positive bias. Thus, small distances and biases 
are desirable. 

Table 3 contains summary statistics for various peak- 
calling programs. For STATl, NEXT-peak found 781 
TPs, a full 41 TPs more than any other program. For 
NRSF, NEXT-peak found 1,507 TPs, more than any 
other program; MTC was the second at 1,498 TPs. For 
ZNF143, NEXT-peak found 707 TPs, less than only 
MACS (709 TPs). For all three datasets, NEXT-peak had 
the smallest mean distances among all programs; it had 
one of the smallest biases as well. In addition, NEXT- 
peak is the only program that produced small biases for 
all three datasets. All other programs show a noticeable 
bias in at least one dataset. Specifically, HPeak and 
QuEST had a noticeable bias in STATl; WTD, MTC, 
MACS and SISSRs had a noticeable bias in NRSF and 
ZNF143; CisGenome had a noticeable bias in STATl 
and ZNF143. 

Comparison of the programs: top peaks in general 

The previous section gives performance measures based 
on the top 2,000 peaks called by each program. Re- 
searchers might wish to compare the measures based on 
lists of top peaks with different truncations, e.g., lists 
truncated at rank 1,000, 4,000, or 10,000. Figure 3 shows 
the precision (fraction of TPs among the top peaks) for 
top peaks truncated at ranks up to 10,000. For each rank 
r in the x-axis, the precision is computed for cumulative 
peaks between rank 1 and rank r. As expected, precision 
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Figure 2 A plot of regions with large number of unmappable locations from STATl ChlP-seq. Tag counts in the left strand are shown as 
blue bars, tag counts in the right strand, as red bars. The unmappable locations are marked by grey blocks. The circles represent motif sites; the 
triangles, estimated sites, (a) 49% locations are unmappable. (b) 41% unmappable. (c) 38% unmappable. 
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Table 3 Program result summary for ChlP-seq datasets 

STAT1 NRSF ZNF143 
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generally decreased with the length of the list. For 
STATl, NEXT-peak had the largest precision (of all pro- 
grams examined) over the full range of lengths, up to 
10,000 peaks. For NRSF, NEXT-peak had nearly the best 
precision up to 4,500 peaks (and in fact, it had the best 
precision at 2,000 peak; see the previous section.); 
NEXT-peak had the best precision between 4,500 and 
10,000 peaks. For ZNF143, NEXT-peak had near the 
best precision up and 10,000 peaks. For ZNF143, MACS 
performed similarly to NEXT-peak between 1,500 and 
10,000 peaks, but MACS had a significantly poorer per- 
formance between 0 and 1,500 peaks compared to other 
programs. 

Figure 4 shows mean distances for TP peaks ranked up 
to 10,000. In all three datasets, for the most of the range 
up to rank 10,000, NEXT-peak had the smallest mean dis- 
tances. Note that other programs did not show the same 
consistency among three datasets in terms of mean dis- 
tances. For example, MTC was the second best in STATl 
but performed poorly for NRSF; QuEST was the second 
best in ZNF143 but performed poorly for STATl. 

Figure 5 shows mean biases up to 10,000 peaks. As 
noted in the previous section, NEXT-peak is the only 



program showing small biases for all three datasets. Any 
other program shows a noticeable bias in at least one 
dataset. That is, for ZNF143, only NEXT-peak, QuEST, 
and HPeak had small biases, but QuEST and HPeak had 
noticeable biases in STATl, making their performances 
highly dependent on the dataset at hand. 

Correlation of estimated standard deviation and distance 
to motif site 

Unlike other programs, NEXT-peak indicates the accur- 
acy of a peak s estimated location by estimating the cor- 
responding standard deviation. Figure 6 displays smooth 
scatterplots for the estimated standard deviation versus 
distance to the nearest candidate site. (The plot is trun- 
cated on both axes at 250 bp, to dampen the influence 
of outliers caused by the omission of a true site among 
the candidate sites.) Dark regions represent high dens- 
ities of data points and small dots represent isolated 
points. Ideally, all points should fall near the line y= x. 
Most data points, however, are concentrated at the 
bottom-left corner for all three plots. That is, most 
standard deviation estimates are small and actual dis- 
tances to the motif site tend to be small as well. The 
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Figure 3 A plot of percentage of top peaks with motif, (a) STATl. (b) NRSF. (c) ZNF143. Some curves were truncated, because QuEST called 
fewer than 3,000 peaks; and WTD and MTC, fewer than 4,000 peaks. (Large percentages are desirable.) 
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Figure 4 A plot of mean distance between top pealcs and motif, (a) STATl. (b) NRSF. (c) ZNF143. Mean distances are average distances 
between motif sites and estimated sties, wliere estimated sites contain a motif site witiiin 250 bp distance. (Small distances are desirable.) 



Pearson correlation coefficients corresponding to the 
smooth scatterplots are 0.43 (STATl), 0.50 (NRSF), and 
0.33 (ZNF143), indicating that although the relationship 
is rather weak, the estimated error is positively corre- 
lated with the actual distance between the estimated 
peak location and the motif site. 

Discussion 

Alone among existing peak-calling programs, NEXT- 
peak analyzes data with a parametric statistical model. 
Of the existing programs, therefore, it alone provides a 
principled foundation for elaborating the statistical ana- 
lyses of ChlP-seq data. One obvious elaboration is to 
model multiple binding events in a region. This work is 
currently underway and the results will be reported 
elsewhere. 

NEXT-peak can estimate the average fragment length, 
even if the experiment does not measure the average 
fragment length. Let d denote the tag length, e.g., d=27 
for STATl. In the NEXT-peak model, the average 



distance from a fragment end to a cross -link is ^. The 
average distance between the fragment ends is therefore 
2^ -\- d - 1 (because the "location" of the right end is the 
leftmost position of the corresponding tag). For the 
STATl dataset, ^ =73.5 and d =27, so the NEXT-peak 
model estimate of the average fragment length is 173.0, 
consistent with a previous estimate of 174 [15]. For 
NRSF data, ^ =30.4 and d=36, the estimate of the frag- 
ment length is 95.8, and for GABP data, ^ =35.3 and 
d=36, the estimate is 105.6. 

Existing programs simply discard ambiguously mapped 
reads. In contrast, NEXT-peak explicitly models the loca- 
tions where reads do not map uniquely into the reference 
genome. NEXT-peak can therefore adjust for ambiguous 
mapping while estimating the total number of tags in a re- 
gion, thereby sharpening its estimates of TF binding 
strength. Sharper estimates of binding strength can pro- 
mote better physical interpretation of ChlP-seq results. 

Existing peak-calling programs require tedious visual 
screening of up to tens of thousands of binding regions. 
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Figure 5 A plot of mean bias between top pealcs and motif, (a) STATl. (b) NRSF. (c) ZNF143.The bias is tine (signed) distance in bp between 



an estimated site and tine nearest motif site. (Small biases are desirable.) 
only program with near-zero bias for all three datasets. 



For all datasets, NEXT-peak biases were near zero. NEXT-peak was the 
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Figure 6 Smooth scatterplots for estimated standard deviation vs. actual distance to nearest motif site, (a) STATl. (b) NRSF. (c) ZNF143. 
Both axes truncated values at 250 bp. For all datasets, the estimated standard deviation and the actual distance to the nearest motif site had a 
positive Pearson correlation coefficient. Dark regions represent high densities of data points and small dots represent isolated points. The majority 
of data points are located at the bottom-left corner for all three panels, hence most standard deviation estimates are small and actual distances 
to the motif site are also small in general. 



to eliminate experimental artifacts like spikes in tag 
numbers caused by PGR anomalies. The goodness-of-fit 
tests in NEXT-peak can reduce the burden of visual 
screening. Moreover, the same tests can detect the pres- 
ence of multiple TF binding sites, which are usually 
found in regions longer than those containing PGR 
anomalies. The long regions with small p-values can 
therefore be set aside for further, more intensive ana- 
lyses, such as searching for multiple binding events or 
sequence motifs. 

For STATl, the observed tag distribution follows the 
NEXT-peak density closely, indicating that the NEXT- 
peak model captured the essence of the physical pro- 
cesses in the GhlP-seq experiment. Gonsequently, 
NEXT-peak outperformed its competitors, possibly be- 
cause the NEXT-peak model successfully mimicked the 
true experimental kernel. On the other hand, for 
ZNF143, the observed tag distribution is somewhat devi- 
ated from the NEXT-peak density, possibly degrading 
NEXT-peaks performance slightly. The observed tag 
density might reflect a mixture of multiple binding 
events, however, resulting from TF binding fluctuating 
between different protein complexes. Mass and struc- 
tural differences between the protein complexes could 
cause binding locations or mean fragment lengths to 
fluctuate. Gonventional motif analysis or a more elabor- 
ate model including multiple binding sites might expose 
the protein-protein interactions, however. 

By adding mappability information, STATl increased 
true-positive binding sites by 4.0% on average. Unlike 
STATl, mappability information for NRSF and ZNF143 
actually degraded the performance of NEXT-peak: on 
average, it decreased true-positives by 0.5% and 0.6%, a 
surprising result given that both NRSF and ZNF143 had 
large numbers of mapped tags (33.1 millions and 25.2 



millions). The truncated read length for NRSF and 
ZNF143 was 36, however, much larger than read length of 
27 for STATl. Thus, fewer genomic locations were 
mapped ambiguously for NRSF or ZNF143 (10.3%) than 
for STATl (16.2%), diminishing NEXT-peaks ability to en- 
hance its performance by adding mappability information. 

This article examined three GhlP-seq datasets with a 
single dominant binding motif, permitting motif sites to 
serve as surrogates for the true binding sites. In general, 
however, even with an antibody specific to a protein, 
protein-protein interactions between TF molecules 
might cause multiple TFs (and hence, multiple motifs) 
to cross-link to an antibody. The two global parameters 
a (the standard deviation for the cross-link locations) 
and P (the intensity of the Poisson process modeling 
shearing) then require delicate estimation. One could se- 
lect a few hundred of the most tag-rich regions. One 
could screen the regions visually, choosing the ones with 
a good fit to the dual normal-exponential density and 
then estimate a and ^. Alternatively, one could perform 
a motif search on the tag-rich regions. The observed tag 
density for each motif then can be fit to the NEXT-peak 
model. Thus, NEXT-peak can analyze any GhlP-seq ex- 
periment, even without specific information on the pro- 
tein interactions. 

Conclusions 

We proposed a new statistical model for identifying 
binding sites from GhlP-seq data. The model success- 
fully mimics the underlying data-generating process in 
GhlP-seq experiments by using the dual density of a 
normal-exponential two-peak model. The NEXT-peak 
program produced better prediction with more true pos- 
itives and a better spatial resolution than any other pro- 
gram tested. The NEXT-peak program tests the validity 
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of its underlying NEXT-peak model without depending (as 
many programs do) on an unrealistic assumption of a glo- 
bal uniform background tag distribution. The NEXT-peak 
program stands alone in quantifying errors by reporting a 
standard error for its estimates of binding intensity. More- 
over, smooth scatterplots showed that its standard errors 
are informative about errors in motif location, as estimated 
from external standards. The NEXT-peak program also 
provides a goodness-of-fit test, automating screening of the 
spurious binding, and work is in progress to extend its 
model to locate multiple binding events in a region. 

Methods 

ChlP-seq datasets 

Our analysis used ChlP-seq datasets corresponding to 
three TPs: STATl, NRSF, and ZNF143. Because the three 
datasets correspond to known binding motifs, they pro- 
vide a gold-standard for evaluating peak-calling pro- 
grams [2]. Table 2 presents summary statistics for the 
three datasets. 

The STATl dataset [15] was downloaded at http://www. 
bcgsc.ca/downloads/chiptf/human/STATl/stimulated/july_ 
23_2008/. The NRSF[SRA:SRR577995] and ZNF143[SRA: 
SRR243553] datasets were downloaded from the SRA 
database at http://www.ncbi.nlm.nih.gov/sra. Bowtie [17] 
mapped tags into a reference human genome (NCBI Build 
36.1) for all three datasets. Mismatches of up to 2 bases were 
permitted, if they mapped uniquely within the genome. 

For all datasets, the PeakSeq suite [5] then determined 
whether tag sequences in the reference genome were 
unique. PeakSeq requires tags of a uniform length. The 
tags for the downloaded STATl dataset, however, had 
varying length although most tags had length 27. We 
truncated the tags to length 27, if they were longer, or 
we discarded them, if they were shorter. Thus, we used 
the mappability information for 27 bp tags to approxi- 
mate the complete STATl data. The downloaded NRSF 
had tags of length 50 and the downloaded ZNF143 had 
tags of length 40. We truncated tags from both datasets 
to length 36 to make it easy to investigate the effect of 
the mappability information (36 bp mappability informa- 
tion was used). Additional file 1 also reports on results 
from three additional datasets, MAX, GABP, and FoxAl. 

Notations for the ChlP-seq data 

Let some preliminary method (see NEXT-peak algorithm 
for detail) flag possible cross-links in candidate genomic 
regions Rr (r = 1, S), Computational time is a consid- 
eration, because S can be on order of 10^ or more. Con- 
sider a specific genomic region Rg, where 5 g {1, .S}, 
and let Rg have width Wg, with the nucleotide bases hav- 
ing coordinates 1, Wg. Call the forward and backward 
DNA strands "left" and "right", so the bases at each loca- 
tion ; on the left and right strands are complementary 



(/ = 1, Wg). Let denote the set of locations within 
Rg where a tag sequence is not unique within the gen- 
ome, so the corresponding tag maps ambiguously. 

The superscripts "L" and "R" distinguish quantities 
pertaining to the left and right strands: note in particular 
that "L" and "R" are not exponents. Within Rg, let a total 
of "left tags" be observed on the left strand; "right 
tags", on the right strand. Define the "location" of a tag 
as its leftmost position. Let the location of the left tags 
map to e{l, (/ = 1, f?^); the location of the 
right tags, to g{1, ...,1^5} (/ = 1, n^). Thus, the data 
are X = {x[^ x^^^, X*^) where no location 

or xf is in X^ 

An alternative representation is occasionally useful. 
Let yf^{0, 1,2, ...} be the number of left tags observed 
at location ; e {1, Wg}; y^, the number of right tags ob- 
served at position Tags cannot be observed at locations 
;gX^ Thus, {y{, ^^.,)i,y^, ...,y^,X^) provides an equiva- 
lent representation of the data X. The model parameters 
for the data in Rg are {/Ug, Vg, pg) and (0^, yS), defined 
below. Parameters specific to the region Rg are 
subscripted with "5"; the parameters common to all gen- 
omic regions lack subscripts. 

Dual density for a binding event 

Let the standard normal distribution have the density 
function 0(») and the cumulative distribution function 
0{*), Each observed right tag location xf in Rg corre- 
sponds to an underlying (and unobservable) random 
variate f/, the coordinate of the corresponding cross-link. 
For mathematical convenience, assume ~ N{iig, cr^), i.e., 
is a normal variate with mean [ig (specific to R^ and 
variance (common to i?^ for r = 1, 5'). The density 
of the distribution of is 

Assume that upon shearing, the breaks in the DNA 
form a Poisson process over the entire genome. Thus, 
the break point at location 0^ (xf > ^i) corresponds to 
an exponential random variate xf-^i. For simplicity, as- 
sume that the mean /3 of the exponential distribution is 
common to all regions R^ (r = 1, 5'), so the density 
function corresponding to xf is 

;r(^|f,,^)=iexp(-^) 

for xf > f and 0 otherwise. 

The joint density of (xf^ ^i) is therefore 
n{xf,^i\f4^,a\^)=n{xf\^i,^)-n{^iKa^). Integrate f,- 
out, to derive the density function of xf: 



Kim et a I. BMC Genomics 2013, 14:349 
http://www.bionnedcentral.conn/1471 -21 64/1 4/349 



Page 9 of 1 2 



exp|-^[:>cf- 



The above distribution is a marginal distribution of 
the normal- exponential joint density, which we call a 
"normal-exponential" distribution in short. Figure 7 
shows a schematic representation for the role of parame- 
ters a and in a ChlP-seq experiment. Figure la shows 
a normal-exponential density for both left and right tags 
(as discussed in Results). 

Poisson regression model for the observed tags 

Within Rgi (1) let Vg be the expected number of right tags 
for each TF molecule that binds; (2) let ps be the uni- 
form background intensity of right tags; and (3) let be 
the expected number of right tags at location Assume 

Approximate the sum over all locations ; with an integra- 
tion, to produce a consistency check: = J •/^(;v|6)<ix 



The expected total number of right tags within due to 
binding is therefore approximately v^, = 0 being equiva- 
lent to the absence of TF binding in R^, On the other hand, 
the expected total number of right tags within Rg due to 
noise is WsPs- 

Let be the random variable that counts right tags 
at and assume Yf has a Poisson distribution with mean 
i.e.. 



?r{Yf=y) = exp(-A;) 



forje {0,1,2,...}. 

The models for the tag locations {cvf} and {;«f} on 

the left and right strands share the parameters (^5, o^, ^, 
V5, ps) and differ only in mirroring the tag location 



a = standard deviation 
for length from center of TF 
to cross-linl<s 



P = average length 
from a cross-link to a tag 




{Xi^} = left tag locations {xi^} = right tag locations 

Figure 7 ChlP-seq experiment witii NEXT-pealc model parameters. A genomic location of tine center of tlie TF is denoted as jj. Green bi- 
directional arrows represent cross-links between a TF and a genomic sequence. Cross-links are assumed to be normally distributed with standard 
deviation o. Tags are shown as small black rectangles at the 5' end of fragments. The distance from a cross-link to a tag location is assumed to 
be exponentially distributed with mean p. When tags are mapped to a reference genome, then tags are projected onto the corresponding 
genomic locations. Blue arrows represent "left" tags mapped on the forward strand; red arrows, "right" tags mapped on the backward strand. The 
tag distribution is the NEXT-peak under the previous assumptions. 
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densities around the link location i.e., 

The models for the left and right tags share and ps, 
so as in the model for the right tags, is the expected 
number of left tags for each TF molecule that binds, and 
Ps is the uniform background intensity of left tags. 

Figure la shows a normal-exponential two-peal< (NEXT- 
peal<) density. In this example, = 0, = 60, and o = 40. The 
two density curves mirror each other around the center loca- 
tion = 0. The curves are asymmetrical distributions skewed 
in opposite directions. Although both tails of each density 
rapidly approach zero, one tail approaches zero much faster 
than the other. Figure la motivates the model name: the 
"normal-exponential two-peal<" (NEXT-peal<) model. 

Let 65 = {[isi Vg, Ps)' Under the NEXT-peak model in 
Figure la, the lil<elihood of the data {y{, .,,,y^,yf, ••♦,3^ , 
mRs is 

L{a,^,Qs) = Hy^xo [Pr(rf =3f |a,^,e,).Pr(rf =yf|a,^,e,)]. 

Thus, the likelihood of the complete dataset is 

i((T,^,e) = ntii(a,^,e,) 

where 6 = (6^ : r = 1, S), 

Because a and ^ do not depend on the region R^, train- 
ing data can yield maximum likelihood estimates (MLEs) 
a and . Fix the values a = a and ^ = ^ > so the 
remaining parameters requiring estimation for the 
NEXT-peak model are 6 = (61, 65). Maximization of 

the profile likelihood 1(65) = L(^a,^,fi^,Vs,p^^ then yields 

the estimate 65 = [fi^^Vs^p^ within each region R^. The 
estimates components are: (1) the estimated mean location 
jk^ of a binding event, (2) the estimated mean total number 
Vs of right (or left) tags within R^ due to the binding event, 
and (3) the estimated uniform background intensity p^ of 
right (or left) tags within 7?^. As usual, the inverse of the 
Fisher Information matrix (i.e., the inverse of the negative 
expectation of the Hessian of the log-likelihood) estimates 
the asymptotic variance-covariance matrix for 65. 

Let hs = (jig^O^p^^ denote the maximum likelihood 

estimate (MLE) of 6^ = (/z^, v^, ps) under the restriction 
= 0. To test whether or not a binding event occurred in 
Rs, under the null hypothesis HqIVs = 0, the likelihood ra- 
tio (LR) statistic 

A = -2log[l(i)/l(e,)] 

has an asymptotic chi-square distribution with 1 degree 
of freedom. Unlike the null hypotheses in many existing 
programs, the NEXT-peak model does not assume a glo- 
bally uniform background intensity. Instead, its locally 



uniform background intensity is equivalent to assuming 
that the expected number of background tags per loca- 
tion varies slowly enough to be almost constant within 
each region R^ {r = 1, S). 

Goodness-of-fit test 

The following can test whether observed tag data are 
consistent with the NEXT-peak model. PGR over- 
amplification and other experimental artifacts can cause 
spikes in the observed tag distribution. Likewise, mul- 
tiple binding events in a region R^ cause deviations from 
the unimodal density of the model. Accordingly, con- 
sider Pr (D\Hq), the probability of the data D under the 
null hypothesis Hq of a NEXT-peak model. Under Hq, 
the counts of right tags at location J are generated 

with the Poisson intensity given above, the counts of 
the left tags being generated with the mirror intensity . 
Consider also the probability Pr (D\Hi) of the data 
under an alternative model Hi where the underlying in- 
tensity is unrestricted. The LR statistic 2log[Pr(D| 
//i)/Pr(D|//o)] for the models follows a distribution, 
with degrees of freedom equal to the difference of the 
number of parameters in Hq and Hi, The LR test yields 
a p-value for each region R^y with a small p-value indi- 
cating a poor fit within R^ to the NEXT-peak model 
underlying Hq, 

P-value computation for finding motif sites 

From a position-specific score matrix, any segment re- 
ceives a score by adding the corresponding columns 
scores from the position-specific score matrix. The prob- 
ability of observing the score or higher is computed 
based on the null model that nucleotides (A, C, G, and 
T) can appear at random with equal probabilities. We 
used the Stadens method [18] for the convolution com- 
putation of the score distribution. The principal of our 
p-value computation has been used for PSI-BLAST [19] 
and A-GLAM [20], among others, for their p-value com- 
putation, where p-values are used to compute E-values. 

Screening based on goodness-of-fit tests 

Like many peak-calling programs, NEXT-peak masks loca- 
tions having anomalously many tags, but it does so in a 
principled manner, with p-values based on goodness-of-fit 
tests. Long regions with small p-values suggest multiple 
binding sites, so for all datasets NEXT-peak masked regions 
less than a certain length long and with small p-values. 
When motif site locations are available, based on the area 
under the precision curve up to the top 10,000 peaks (e.g. 
Figure 3), the NEXT-peak program automatically reports a 
cut-off recommendation for each dataset. The Results sec- 
tion uses the length 400 and the p-value 10'^ as a cut-off 
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for STATl and the length 300 and the p-value 10'^ as a 
cut-off for NRSF and the length 400 and the p-value 10'^ as 
a cut-off for ZNF143, all suggested by the NEXT-peak 
output. 

NEXT-peak algorithm 

The NEXT-peak program goes through the following 
procedures for producing an output (Figure 8 shows a 
flowchart for the NEXT-peak program). (1) Read the 
mapped tag location file, e.g., from a Bowtie [17] output. 
(2) Select regions based on the tag count with a user 
specified length of the window (default: 150) and a user 
specified minimum count (default: 15). When a neigh- 
boring window has more than the minimum count, the 
window under scrutiny is combined with its neighbor. 
The region lengths range from the minimum length (de- 
fault: 150) to several thousands. (3) When motif site lo- 
cations are available, estimate a and ^ by maximizing the 
likelihood using motif site locations. For a known TF, 
a publicly available motif pattern is used, e.g. from 



JASPAR [16]. For an unknown motif, run the NEXT- 
peak program with default values (cr = 30 and ^ = 50), 
and identify the strongest motif from a motif search. Al- 
ternatively, a user can estimate these parameters with se- 
lected regions. (4) For each region, estimate fi and v by 
maximizing the likelihood. It computes the standard de- 
viation estimates for these estimates. Then, perform a 
goodness-of-fit test for each region. (5) As a post- 
processing step, compute the region length and p- 
value cut-off recommendations to screen out potential 
spurious regions when the motif site locations are 
available. 



NEXT-peak software 

The NEXT-peak program is implemented in C++. 
The code is publicly available at http://www.odu. 
edu/'-nxkim/nextpeak/. A typical computation time is 
about 15-45 minutes, depending on the size of the in- 
put data. 




Estinnate a and P 
using anchored data 



Estinnate \i and v 
for each region 



Use default a and P 
Estinnate |i and v for each region 
Identify strongest motif with 
regions having top peaks 



Compute cutoff for 
spurious regions 




Figure 8 A flowchart for NEXT-peak algorithm. First, the program reads the mapped tags. Then, it selects regions based on the tag count 
with a user specified length of the window (default: 150) and a user specified minimum count (default: 15). If motif site locations are provided, it 
estimates o and P using motif site locations. For an unknown motif, run the NEXT-peak program with default values (a = 30 and /3 = 50), and 
identify the strongest motif from a motif search. For each region, the program estimates jj and v. It computes the standard deviation estimates 
for these estimates. As a post-processing step, the program computes the region length and p-value cut-off recommendations to screen out 
potential spurious regions when the motif site information is available. 
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Additional file 



Additional file 1: Supplementary material. Supplementary material 
contains results for additional datasets, MAX, GABP, and FoxAl. This file 
contains four supplementary figures and two supplementary tables: 
Figure SI. A plot of percentage of top peaks with motif Table SI reports 
estimated values |8 and 6 for each dataset. Figure S2. A plot of 
percentage of top peaks with motif Some curves were truncated in (a), 
because QuEST called fewer than 5,000 peaks; MTC, fewer than 7,000 
peaks; and WTD, fewer than 9,000 peaks. In (b), QuEST, WTD, and MTC 
called fewer than 4,000 peaks. Figure S3. A plot for mean distance 
between top peaks and motif Mean distances are average distances 
between motif sites and estimated sites, where estimated sites contain a 
motif site within 250 bp distance. Figure S4. A plot of mean bias between 
top peaks and motif The bias is the (signed) distance in bp between an 
estimated site and the nearest motif site. Table SI. Summary of additional 
ChlP-seq datasets. Basic characteristics of additional datasets. Table S2. 
Program result summary for additional ChlP-seq datasets. 
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